READ ME 
Guide to using scripts for slope analysis. 

Author: Fiorella Carla Grandi
Last revision: 05/2/2015

Table of Contents
1. List of Scripts
2. Necessary file formats
3. Calling CGIs
4. Mapping methylation data onto CGI
5. Selecting CGIs
7. Surrounding Shore CpG Analysis
8. Average Shore Methylation and Slope Calculation
9. Test data

1. List of scripts
CGImapper.pl
high_low_split.pl
Single_CGI_finder.pl
CGI_pair_finder.pl
CGI_surrounding.pl
CGI_pair_surrounding.pl
local_average.pl
global_pairs.pl
slope_calc.pl

2. Necessary file formats
The scripts assume that your data is tab delimited and that CpGs methylation is associated with the "C" base pair. Data that is not formatted in this way can be changed using Galaxy or custom script. Otherwise, change the designator used for the split operator. Note that scripts are set up to run each chromosome separately, i.e. they will not distinguish between methylation between on chr1 and chr2. Make sure files are designated properly. 

3. Calling CGIs
CGIs were designated using newcpgseek (see paper). Scripts assumes that you have a list of CGI which includes: start, end, number of CpGs and length in a tab delimited file. 

4. Mapping methylation data onto CGI
use: CGImapper.pl

Data sets that you'll need:
A) CpG methylation. File should contain location, % methylation and sequencing coverage. It should be tab delimited. 
B) CGI definitions. File should contain start, end and number of CpGs for each island. It should be tab delimited. 

Usage: Change the file names in the "Load data sections" (lines 28 and 61) with your files. Or, if you want to run the test data set, leave it as is. Change the name of the outputfile in line 86.
 
From command line, type perl CGImapper.pl Test data set takes XX minutes to run. 

5. Selecting CGIs
CGIs were segregated into 3 categories: single low, single high and pairs which can either be low-low, low-high, or high-high. 

To categorize the islands, first run high-low_split.pl This will divide everything in "low" and "high". The desired percentage can be changed by modifying lines 28 and 31. Specify the file name for the mapped CGIs (from #4) on line 18. Specify the files names for low and high output files on lines 20 and 21. Will return files in same format as #4 but divided based on methylation. 
 
Then you can either find single CGIs or paired CGIs. 

FOR SINGLE CGIs: use Single_CGI_finder.pl
Input a complete list of CGIs on line 18 (these SHOULD NOT be low/high segregated). Input the low CGI file name on line 48 and the desired output file name on line 49. Input the high CGI file name on line 51 and the desired output file name on line 52. The desired distance between one CGI and other can be adjusted by changing the distance at lines 68 or 93. 
The script will print out a list of CGIs in the same format at #4 but these will either be single low or single high. 

FOR PAIRS: use CGI_pair_finder.pl
Note that you will need to run this script one time for each of the possible combinations: low-low, low-high, high-low and high-high. The type of pair you are looking for is determined by the time of files that you provide in lines 15 and 17. Specify the name of the output file in line 19.  You can change the pair distance at line 59. This file returns: location of  the first CGI (start), length of first CGI in pair, location of the second CGI in pair (start) and distance between them. 

7. Surrounding Shore CpG Analysis

FOR SINGLES: Low and high can be run together with the CGI_surrouding.pl script. Specify the methylation data file in line 20. Specify the high CGI file on line 44 and the low CGI file on line 47. Name the output files for high and low on lines 45 and 48, respectively. Script will output the start location of the island, upstream/downstream indicator, the genomic position of the CpG, the distance to the CGI and the methylation status of the CpG. 

FOR PAIRS: Each set of pair files has to be run separately under the current system. Script: CpG_pair_surrounding.pl Specify methylation data file to line 19. Specify pair file (from previous script) to line 42 and choose output file name on line 43. Script will output a file that includes: the location of the CGI, upstream/downstream in pair, location of CpG, distance to CGI, methylation of CpG, distance of CGI pair

8. Average Shore Methylation and Slope Calculation

To find local averages: Local averages are used the for the single CGIs. Use: local_average.pl 
You will need to specify the file that you want to average on line 29 and the name for the temporary output file on line 30 AND on line 130. Results will output to the terminal and will give you the interval start the average methylation of the interval, the SD and the # of CGIs that went into that interval. You can change the step size by changing lines 55 and 56 and changing the subroutines to count by the appropriate step size for your interval. 

To find global average: Global averages are used for the paired CGIs. 
use: global_pairs.pl 
There are a couple of parameters that you have to specify in this program:
1. upstream or downstream (lines:28-29)
2. Distance range of the CGI pair (line 40)
You can also change the step size from 100 at lines 64 and 65 and appropriate change the number of subroutine iterations. 
Will print start and end of interval, global average, SD and number of CGIs to the terminal. 

To find the slopes: use slope_calc.pl
Specify the file that you want to use at line 19. Specify the output name at line 20. You can change the locations to calculate the slope at by changing the values in lines 24-37.
 

9. Cumulative Difference between the shores (i.e DMRs) were found using Shore_DMRs.pl and SHORe_DMRs_sum.pl.  Shore_DMRs.pl takes the files generated by the CGI_surrounding.pl script and calculates the different in methylation between each pair of points. The cumulative sum is then calculated in Shore_DMRs_sum.pl. 
 